week 2 exercise - part 1

Basic visualization with Matplotlib¶

Matplotlib¶

First, we import the required libraries, using standard conventions. We first import numpy for all our mathematical needs, then the matplotlib as plotting library and pyplot which gives an easy API to create plots with matplotlib. Later we will introduce Seaborn as well.

InĀ [1]:
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns

# we need the following line to indicate that the plots should be shown inline with the Jupyter notebook.
%matplotlib inline 

We will first create a simple plot of a mathematical function. We first create a numpy array of x-values. Then for each x-value we create the y-value, i.e. the function value. Plotting this function is as easy as giving it the x and y values.

InĀ [2]:
X = np.linspace(-np.pi, np.pi, 100) # define a NumPy array with 100 points in the range -Pi to Pi
Y = np.sin(X)  # define the curve Y by the sine of X

plt.plot(X,Y); # use matplotlib to plot the function
No description has been provided for this image

While creating such plots is perfectly fine when you are exploring data, in your final notebook the plot is hard to understand for the reader. With matplotlib it is very easy to add labels, a title and a legend. You can also change the limits of the plot, the style of the lines and much more.

The following could be seen as the bare minimum for a plot to be understood as part of reproducible research.

InĀ [3]:
plt.plot(X, Y, 'r--', linewidth=2)
plt.plot(X, Y/2, 'b-', linewidth=2)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Plot Title')
plt.xlim(-4, 4)
plt.ylim(-1.2, 1.2)
plt.legend(['red curve', 'blue curve'], loc='best')
Out[3]:
<matplotlib.legend.Legend at 0x259ff990380>
No description has been provided for this image

Go to the documentation pages of Matplotlib http://matplotlib.org/contents.html to find all the possible options for a plot and also to see more tutorials, videos and book chapters to help you along the way.

Another nice tutorials:

  • http://www.labri.fr/perso/nrougier/teaching/matplotlib/

This assignment first shows you how to download csv data from an online source. Then we're exploring a dataset of all the cities in the world and compare cities in The Netherlands to the rest of the world.

Loading data CSV and Pandas¶

We will work with a database of information about cities around the world:

https://dev.maxmind.com/geoip/geoip2/geolite2/

Working with data structures can be done in many ways in Python. There are the standard Python arrays, lists and tuples. You can also use the arrays in the numpy package which allow you to do heavy math operations efficiently. For data analysis Pandas is often used, because data can be put into so-called dataframes. Dataframes store data with column and row names and can easily be manipulated and plotted. You will learn more about Pandas in the Machine Learning workshops. A short intro can be found here:

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

InĀ [4]:
import urllib.request as urllib, zipfile, os

url = 'https://github.com/CODAIT/redrock/raw/master/twitter-decahose/src/main/resources/Location/'
filename = 'worldcitiespop.txt.gz'
datafolder = 'data/'
InĀ [5]:
downloaded = urllib.urlopen(url + filename)
buf = downloaded.read()

try:
    os.mkdir(datafolder)
except FileExistsError:
    pass

with open(datafolder + filename, 'wb') as f:
    f.write(buf)
InĀ [6]:
import pandas as pd
# reading files may cause problems or give errors... Can you explain the use of the encoding parameter?
cities = pd.read_csv(datafolder + filename, sep=',', low_memory=False, encoding = 'ISO-8859-1')

Data Manipulation¶

We can take a peek at the data by checking out the final rows of data. Do you see any potential problem with this dataset?

InĀ [7]:
cities = cities.dropna(subset=['Population'])
cities.tail()
Out[7]:
Country City AccentCity Region Population Latitude Longitude
3173646 zw redcliffe Redcliffe 06 38231.0 -19.033333 29.783333
3173676 zw rusape Rusape 04 23761.0 -18.533333 32.116667
3173737 zw shurugwi Shurugwi 07 17107.0 -19.666667 30.000000
3173892 zw victoria falls Victoria Falls 00 36702.0 -17.933333 25.833333
3173957 zw zvishavane Zvishavane 07 79876.0 -20.333333 30.033333
InĀ [8]:
cities.sort_values(by='Population', ascending=False).head(20)
Out[8]:
Country City AccentCity Region Population Latitude Longitude
1544449 jp tokyo Tokyo 40 31480498.0 35.685000 139.751389
570824 cn shanghai Shanghai 23 14608512.0 31.045556 121.399722
1327914 in bombay Bombay 16 12692717.0 18.975000 72.825833
2200161 pk karachi Karachi 05 11627378.0 24.905600 67.082200
1349146 in new delhi New Delhi 07 10928270.0 28.600000 77.200000
1331162 in delhi Delhi 07 10928270.0 28.666667 77.216667
2130459 ph manila Manila D9 10443877.0 14.604200 120.982200
2461968 ru moscow Moscow 48 10381288.0 55.752222 37.615556
1626528 kr seoul Seoul 11 10323448.0 37.598500 126.978300
316800 br sao paulo SĆ£o Paulo 27 10021437.0 -23.473293 -46.665803
2800596 tr istanbul Istanbul 34 9797536.0 41.018611 28.964722
2003442 ng lagos Lagos 05 8789133.0 6.453056 3.395833
1892345 mx mexico Mexico 09 8720916.0 19.434167 -99.138611
1186762 id jakarta Jakarta 04 8540306.0 -6.174444 106.829444
2990572 us new york New York NY 8107916.0 40.714167 -74.006389
362418 cd kinshasa Kinshasa 06 7787832.0 -4.300000 15.300000
842667 eg cairo Cairo 11 7734602.0 30.050000 31.250000
2074194 pe lima Lima 15 7646786.0 -12.050000 -77.050000
553246 cn peking Peking 22 7480601.0 39.928889 116.388333
996635 gb london London H9 7421228.0 51.514125 -0.093689

By sorting the cities on population we immediately see the entries of a few of the largest cities in the world.

Assignment 1a¶

To get an idea of where in the world the cities in the dataset are located, we want to make a scatter plot of the position of all the cities in the dataset.

Don't worry about drawing country borders, just plot the locations of the cities.

Remember to use all the basic plot elements you need to understand this plot.

InĀ [9]:
import numpy as np
from matplotlib import pyplot as plt

plt.scatter(cities['Longitude'],cities['Latitude'])
plt.title('Positions of Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Assignment 1b¶

Create a visualization to show the top-20 cities with the highest population.

Remember to use all the basic plot elements you need to understand this plot.

InĀ [10]:
top_cities = cities.sort_values(by='Population', ascending=False).head(20)
plt.scatter(cities['Longitude'],cities['Latitude'], color='gray')
plt.scatter(top_cities['Longitude'],top_cities['Latitude'], color='red')
plt.title('Positions of Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Assignment 1c¶

Now we want to plot the cities in The Netherlands only. Use a scatter plot again to plot the cities, but now vary the size of the marker and the color with the population of that city.

Use a colorbar to show how the color of the marker relates to its population.

Use sensible limits to your axes so that you show only mainland The Netherlands (and not the Dutch Antilles).

InĀ [11]:
dutch_cities = cities[ cities['Country'] =='nl' ]
max_population = dutch_cities['Population'].max()
size_marker = [20 * n / (max_population / 50) for n in dutch_cities['Population']]
plt.figure(figsize=[7,7]);
plt.xlim(3.2, 7.5)
plt.ylim(50.6,53.6)
cmap='viridis'
plt.scatter(dutch_cities['Longitude'],dutch_cities['Latitude'],s=size_marker, c=dutch_cities['Population'], cmap='plasma')
plt.colorbar(label='Population')
plt.title('Positions of Dutch Cities in the Dataset')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.figure();

## Your code and explanation in comments...
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Assignment 1d¶

Looking at the previous assignment, we can see larger cities such as Amsterdam, Rotterdam and even Eindhoven. But we still do not really have a clear overview of how many big cities there are. Create a visualisation to show the distribution of the population for all Dutch cities.

Add proper basic plot elements to this plot and add an annotation to indicate Amsterdam and Eindhoven in this distribution.

InĀ [12]:
## Your code and explanation in comments...
dutch_cities = cities[ cities['Country'] =='nl' ]

plt.hist(dutch_cities['Population'], color='blue', edgecolor='black',bins=50)
plt.xlabel('Population')
plt.ylabel('Count')
plt.title('Population Distribution of Dutch Cities')
plt.grid(True)

# Add annotations for Amsterdam and Eindhoven
amsterdam_population = dutch_cities[dutch_cities['City'] == 'amsterdam']['Population']
eindhoven_population = dutch_cities[dutch_cities['City'] == 'eindhoven']['Population']

plt.annotate('Amsterdam', xy=(amsterdam_population, 3), xytext=(amsterdam_population - 100000, 15),
             arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.annotate('Eindhoven', xy=(eindhoven_population, 3), xytext=(eindhoven_population + 1000, 15),
             arrowprops=dict(facecolor='black', arrowstyle='->'))


plt.figure();
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Assignment 1e¶

Now we want to compare how the distribution of Dutch cities compares to that of the entire world.

Use subplots to show the dutch distribution (top plot) and the world distribution (bottom plot).

InĀ [13]:
plt.figure(figsize=[20, 8]);

plt.subplot(2,1,1);
plt.hist(np.asarray(dutch_cities.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.1)

# Add the subplot of the world cities below this Dutch one
plt.subplot(2,1,2)
plt.hist(np.asarray(cities.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.1)
## Your code and explanation in comments...
plt.title('Distribution of World Cities')
plt.xlabel('Population (thousands)')
plt.ylabel('Density')
Out[13]:
Text(0, 0.5, 'Density')
No description has been provided for this image

Assignment 1f¶

Write what conclusions you can deduce from the above plots?

InĀ [14]:
# That dutch cities seem more evenly spread with population compared to all cities where we can see more people living in small areas.

Assignment 2¶

Create a data visualization to compare the top-3 largest cities for Japan, Germany and your own (home) country. Add a clear conclusion about the comparison.

InĀ [15]:
cities_in_Japan = cities[cities['Country'] == 'jp']
cities_in_Germany = cities[cities['Country'] == 'de']
cities_in_Bulgaria = cities[cities['Country'] == 'bg']

plt.figure(figsize=[20, 12]);

plt.subplot(3,1,1);
plt.hist(np.asarray(cities_in_Japan.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)

plt.subplot(3,1,2)
plt.hist(np.asarray(cities_in_Germany.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)

plt.subplot(3,1,3)
plt.hist(np.asarray(cities_in_Bulgaria.dropna().Population/1000), bins=np.arange(0, 200, 1), density=1);
plt.ylim(0,0.15)

plt.suptitle('Distribution of Japan, Germany and Bulgaria', fontsize=16)
plt.xlabel('Population (thousands)', fontsize=14)
plt.ylabel('Density', fontsize=14)

# When comparing the population distributions of cities in Japan, Germany and Bulgaria theres a bit of contrast. Going through the cities of Japan we can see that most of the
# is living in the bigger cities. Germany's population on the other hand is more focused on middle-sized cities. And when we check out Bulgaria we can see that most of the
# lives in small cities. If we go from Japan through Germany and to Bulgaria there is a downward trade of living in big cities.
Out[15]:
Text(0, 0.5, 'Density')
No description has been provided for this image

week 2 exercise - part 2

Data visualization (part 2): Two additional Chart Types for Exploring¶

This assignment first shows two useful chart types: parallel coordinates and scatter matrix. You will practice these plots using a new dataset.

Parallel Coordinates with Pandas¶

First, we import the required libraries, using standard conventions. For the example of parallel coordinates we shall use the famous iris data set, describing the sepal and petal dimensions for three types of irises.

InĀ [16]:
import pandas as pd
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', sep=',', low_memory=False, encoding = 'ISO-8859-1', header=None)
iris.columns = ['sepal width','sepal length','petal width','petal length', 'name']

iris.head()
Out[16]:
sepal width sepal length petal width petal length name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Now we do not use matplotlib directly but use a plot function of the pandas library that uses matplotlib in the background. In this case we create a parallel coordinates plot.

Pandas has many plotting function as can be seen here: http://pandas.pydata.org/pandas-docs/stable/visualization.html#parallel-coordinates

The parallel coordinates plot can give insight into a dataset with a large number of features. For the iris set there are four features (petal width, petal length, sepal width, sepal length).

While you can make a scatter plot with 4 features using x,y,color and size; a parallel coordinates plot is usually easier to understand once you know how to read it. Here would be the scatter plot:

InĀ [17]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline 

fig = plt.figure()
plt.scatter(iris['petal width'], iris['petal length'], c=iris['sepal width'], s=iris['sepal length']**4)
plt.xlabel('petal width [cm]')
plt.ylabel('petal height [cm]')
plt.colorbar(label='sepal width [cm]');
No description has been provided for this image
InĀ [18]:
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline 

fig = plt.figure(figsize=[15,6])
ax = parallel_coordinates(iris,'name')
ax.set_ylabel('width/length [cm]');
No description has been provided for this image

Scatter Matrix with Pandas¶

A scatter matrix is a chart that gives you an overview of the correlations between any number of feaures.

InĀ [19]:
from pandas.plotting import scatter_matrix
scatter_matrix(iris, alpha=1, figsize=(12, 12), diagonal='kde');
No description has been provided for this image
InĀ [20]:
# or see what happens if we use the Seaborn library...
sns.pairplot(iris)
Out[20]:
<seaborn.axisgrid.PairGrid at 0x259ebaccd70>
No description has been provided for this image
InĀ [21]:
# Seaborn provides some simples ways to explore the data and correlations in more (visual) detail...
sns.pairplot(iris, hue="name")
Out[21]:
<seaborn.axisgrid.PairGrid at 0x259eb394d40>
No description has been provided for this image

Assignment 3¶

Now try to create similar plots for a new dataset about car features.

InĀ [22]:
# The data file is quite nasty with several different delimeters that read_csv cannot handle very well
names=['mpg','cylinders','displacement','horsepower','weight','acceleration','model year','origin','car name','j','k','l','m','n']
cars = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', delimiter=r"\s+", names=names, header=None, engine='python')
# Create a subset of dataset with all useful features
cars = cars.iloc[:,[0,1,2,4,5,6,7]]

scatter_matrix(cars, alpha=1, figsize=(12, 12), diagonal='kde');

sns.pairplot(cars)

sns.pairplot(cars, hue='origin')
Out[22]:
<seaborn.axisgrid.PairGrid at 0x259e52c4b00>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Create a normalized dataset¶

using Mean normalization (see: https://en.wikipedia.org/wiki/Feature_scaling#Mean_normalization)

InĀ [23]:
cars_norm = (cars - cars.mean())/cars.std()

Next, create a parallel coordinates plot. What happens when you do not use the normalized data?

InĀ [24]:
## Create the parallel coordinates plot here
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline 

fig = plt.figure(figsize=[15,6])

ax = parallel_coordinates(cars_norm,'origin', color=('blue', 'green', 'red'))
ax.set_ylabel('Values');
No description has been provided for this image

Answer this question: What conclusions can you make from the relation between weight and acceleration? If you don't understand how to interpret parallel coordinates plots, read: https://eagereyes.org/techniques/parallel-coordinates.

InĀ [25]:
# In conclusion the relation between weight and acceleration is very logical: the more the weight the less the acceleration and of course the opposite, the less weight the more
# acceleration

Next, try to highlight the model years >= 80.

Hints:

  • you can slice your data with cars_norm[cars['model year']>=80].
  • you can plot both all data and the sliced data on top of each other with different colors
InĀ [26]:
## Create the parallel coordinates plot here
import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import parallel_coordinates
%matplotlib inline 

cars_after_80 = cars[cars['model year']>=80]
cars_norm_after_80 = (cars_after_80 - cars_after_80.mean())/cars_after_80.std()

fig = plt.figure(figsize=[15,6])
# ax = parallel_coordinates(cars_norm,'origin', color=('darkblue', 'gray', 'purple'))
ax = parallel_coordinates(cars_norm_after_80,'origin',color=('blue', 'green', 'red'))
plt.title('Parallel Coordinates Plot')
plt.xlabel('Features')
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend(loc='upper right')
fig = plt.figure(figsize=[15,6])
No description has been provided for this image
<Figure size 1500x600 with 0 Axes>

Answer this question: what conclusions can you draw from cars with model years 80-82?

InĀ [27]:
## Low mpg for high cylinders, slightly better mpg for lower cylinders but not that significant difference. Weight too acceleration ratio seems a bit more balanced when 
# compared to all models. 

Now, create a scatter matrix for the car data. Do we need to use the normalized data? Are we looking for a dataset that we can easily cluster or will we get more luck looking for trends?

InĀ [28]:
## Create the scatter matrix here
scatter_matrix(cars_norm, alpha=1, figsize=(12, 12), diagonal='kde');
No description has been provided for this image

What are your final conclusions looking at the (visual) results? What did you learn about the data and dataset? Or what new questions did you derive from the plots you've made?

InĀ [29]:
## Final conclusion is that I would go looking for trends rather than clustering.I learnt that there are relationships between mpg and cylinders and mpg and weight.